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Abstract 

Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization 
problems. We study how such algorithms can be improved using accelerated gradient methods. We 
provide a novel analysis, which shows how standard gradient methods may sometimes be insufhcicnt to 
obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this 
deficiency, enjoys a uniformly superior guarantee and works well in practice. 



1 Introduction 

We consider a stochastic convex optimization problem of the form 

min Z/(w), 
wew 

where 

i(w)=E,[^(w,^)], 

and optimization is based on an empirical sample of instances zi, . . . ,Zm- We focus on objectives ^(w,z) 
that are non- negative, convex and smooth in their first argument (i.e. have a Lipschitz-continuous gradient). 
The classical learning application is when z = (x, t/) and ^(w, (x, y)) is a prediction loss. In recent years, 
there has been much interest in developing efficient first-order stochastic optimization methods for these 
problems, such as stochastic mirror descent [2, 6] and stochastic dual averaging [9, 16]. These methods are 
characterized by incremental updates based on subgradients di{w, Zi) of individual instances, and enjoy the 
advantages of being highly scalable and simple to implement. 

An important limitation of these methods is that they are inherently sequential, and so problematic to 
parallelize. A popular way to speed-up these algorithms, especially in a parallel setting, is via mini-batching, 
where the incremental update is performed on an average of the subgradients with respect to several instances 
at a time, rather than a single instance (i.e., j J2^j=i di{w, Zi+j)). The gradient computations for each mini- 
batch can be parallelized, allowing these methods to perform faster in a distributed framework (see for 
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Algorithm 1 Stochastic Gradient Descent with Mini-Batching (SGD) 
Parameters: Step size i], mini-batch size b. 
Input: Sample zi, . . . ,Zm 
Wi = 

for « = 1 to n = m/b do 

Let = iE?L6(i-i)+i^(wi,2t) 

w^+i := Wi - riVii{vfi)) 

Wi+l :=Pyj;(w^_^i) 

end for 

Return w = - y^"_i Wj 



instance [11]). Recently, [10] has shown that a mini-batching distributed framework is capable of attaining 
asymptotically optimal speed-up in general (see also [1]). 

A parallel development has been the popularization of accelerated gradient descent methods [7, 8, 15, 5]. In 
a deterministic optimization setting and for general smooth convex functions, these methods enjoy a rate of 
0(l/n^) (where n is the number of iterations) as opposed to 0{l/n) using standard methods. However, in 
a stochastic setting (which is the relevant one for learning problems) , the rate of both approaches have an 
0{l/y/n) dominant term in general, so the benefit of using accelerated methods for learning problems is not 
obvious. 



Algorithm 2 Accelerated Gradient Method (AG) 
Parameters: Step sizes (7i,/3i), mini-batch size b 
Input: Sample Zi,. . . ,Zm 
w = 

for i = 1 to n = m/b do 

Let £i{yvi) := i Et=6(i-i)+i ^(w, Zt) 

d := + (1 - /3-i)wf 

w^+, := w-d_^.V£i(w-d) 
Wi+i :=Pw(w^+i) 

w^^i^/3r'w,+i + (l-/3ri)wr 
end for 
Return w!:^ 



In this paper, we study the application of accelerated methods for mini-batch algorithms, and provide 
theoretical results, a novel algorithm, and empirical experiments. The main resulting message is that by 
using an appropriate accelerated method, we obtain significantly better stochastic optimization algorithms 
in terms of convergence speed. Moreover, in certain regimes acceleration is actually necessary in order to 
allow a significant speedups. The potential benefit of acceleration to mini-batching has been briefly noted 
in [4] , but here we study this issue in much more depth. In particular, we make the following contributions: 

• We develop novel convergence bounds for the standard gradient method, which refines the result of 
[10, 4] by being dependent on L(w*) — infwew the expected loss of the best predictor in our class. 
For example, we show that in the regime where the desired suboptimality is comparable or larger than 
L(w*), including in the separable case L{w*) = 0, mini-batching does not lead to significant speed-ups 
with standard gradient methods. 

• We develop a novel variant of the stochastic accelerated gradient method [5] , which is optimized for a 
mini-batch framework and implicitly adaptive to L(w*). 
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• We provide an analysis of our accelerated algorithm, refining the analysis of [5] by being dependent 
on L{w*), and show how it always allows for significant speed-ups via mini-batching, in contrast to 
standard gradient methods. Moreover, its performance is uniformly superior, at least in terms of 
theoretical upper bounds. 

• We provide an empirical study, validating our theoretical observations and the efficacy of our new 



2 Preliminaries 

We consider stochastic convex optimization problems over some convex domain W. Here, we take W to be a 
convex subset of a Euclidean space, and use ||w|| to denote the standard Euclidean norm. In the Appendix, 
we state and prove the result in a more general setting, where W is a convex subset of a Banach space, and 
||vir|| can be an arbitrary norm, (subset of Euclidean case, see Appendix for the more general Banach space 
case), using an i.i.d. sample Zi,. . . ,Zm € ^ drawn from some fixed distribution. 

Throughout this paper wc assume that the instantaneous loss : W x Z h- > M is convex in its first argument 
and non- negative. We further assume that the loss is if-smooth in its first argument for each z £ Z. That 
is for every z G Z and w, w' G W, 

\\Veiw, z) - V^(w', z)|| <H\\w- w'll 
(for more general Banach space case, the norm on the left hand side is the dual norm). Let us denote 

L(w) :-E,[^(w,z)] 

We wish to minimize L(w) over convex domain W. We will provide guarantees on X(w) relative to L(w*) 
at some w* e W, where the guarantees also depend on ||w*||. We could choose w* := argminwgw -^(w), 
though our results hold for any w* G W, and in some cases we might choose to compete with a low- norm 
w* that is not optimal in W. 

The behavior of the accelerated gradient method also depends on the radius of W, defined as: 



We discuss two stochastic optimization approaches to deal with this problem: stochastic gradient descent 
(SGD), and accelerated gradient methods (AG). In a mini-batch setting, both approaches iteratively average 
sub-gradients with respect to several instances, and use this average to update the predictor. However, the 
update is done in different ways. In the Appendix, we also provide the form of the update in the more 
general mirror descent setting, where ||w|| is an arbitrary norm. 

The stochastic gradient descent algorithm is summarized as Algorithm 1. In the pseudocode, Pyy refers to 
the projection on to the ball W (under the Euclidean distance). The accelerated gradient method (e.g., [5]) 
is summarized as Algorithm 2. 

In terms of existing results, for the SGD algorithm we have [4, Section 5.1] 



method. 



D := sup ||w|| 




whereas for an accelerated gradient algorithm, we have [5] 
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where in both cases the dependenee on D.H and ||w*|| is suppressed. The above bomids suggest that, as 
long ash = o{^m), both methods allow us to use a large mini-batch size h without significantly degrading 
the performance of either method. This allows the number of iterations n = m/b to be smaller, potentially 
resulting in faster convergence speed. However, these bounds do not show that accelerated methods have a 
significant advantage over the SGD algorithm, at least when b = o{y/rn), since both have the same first-order 
term l/^/m. To understand the differences between these two methods better, we will need a more refined 
analysis, to which we now turn. 



3 Convergence Guarantees 

The following theorems provide a refined convergence guarantee for the SGD algorithm and the AG algorithm, 
which improves on the analysis of [10, 4, 5] by being explicitly dependent on L(w*), the expected loss of the 
best predictor w* in W. 



Theorem 1. For any w* S W, using Stochastic Gradient Descent with a step size ofrj = min ■ 
we have: 



M|W||^ 

1+ 



./fl-llw* II 

Y i,(w*)i 



Note that the radius D does not appear in the above bound, which depends only on ||w*||. This means that 
W could be unbounded, perhaps even the entire space, and a projection step for SGD is not really crucial. 

The step size, of course, still depends on l|w*l|. 

Theorem 2. For any w* e W, using Accelerated Gradient with step size parameters (ii = 7, = 7?^ 
where 

-y-min/^ J "H-^H' ( " ( "-'H' ^ 1 m 



an- 



d 



p = min^max<! ^ ^ , , ^"^^ r^A A) , (2) 

^ ' ' 21og(n- 1)' 2(log(6(n- 1)) -loglog(n)) ' ' ' ' ^' 



as long as n> 783, we have: 



E [L(w-)1 - L(w*) < U7xh^^^tl^ + 367if||wir/^Z^i 546gi^^ 7log(n) 5H\\^^f 

^ iHD^Liw*) 367 HD^ M6H ^/]og(nj 5HD^ 
~ \ bn y/bn bn n? 

Unlike for SGD, notice that the bound for the AG method above does depend on D, and a projection step 
is necessary for our analysis. However it is worth noting that D only appears in terms of order at least 1/n, 

and appears only mildly in the l/{y/bn) term, suggesting some robustness to the radius D. 

We emphasize that Theorem 2 gives more than a theoretical bound: it actually specifics a novel accelerated 
gradient strategy, where the step size 7^ scales polynomially in i, in a way dependent on the minibatch size h 
and L(w*). While L(w*) may not be known in advance, it does have the practical implication that choosing 
7i DC i'' for some p < 1, as opposed to just choosing 7^ oc i as in [5]), might yield superior results. 
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We now provide a proof sketch of Theorems 1 and 2. A more general statement of the Theorems as well as 
a complete proof can be found in the Appendix. 

The key observation used for analyzing the dependence on L{w*) is that for any non-negative ff-smooth 
convex function / : W i-^- M, we have [13]: 

||V/(w)|| < V4i?/(w) (3) 

This self-bounding property tells us that the norm of the gradient is small at a point if the loss is itself 
small at that point. This self-bounding property has been used in [14] in the online setting and in [13] in the 
stochastic setting to get better (faster) rates of convergence for non-negative smooth losses. The implication 
of this observation are that for any w € W, II VL(w)|| < y/4HL(w) and Vz e Z, \\i{w, z)\\ < y/4He{w,z). 



Proof sketch for Theorem 1. The proof for the stochastic gradient descent bound is mainly based on the 
proof techniques in [5] and its extension to the mini-batch case in [10]. Following the line of analysis in [5], 
one can show that 



E 



n-1 



i(w*) < ^ 5^E [||VL(w,) - V£i{wi 



+ 



2r7(Ti-l) 



In the case of [5], E [||VL(wi) — Vi?i(wi)||] is bounded by the variance, and that leads to the final bound 
provided in [5] (by setting 77 appropriately). As noticed in [10], in the minibatch setting we have V^i(wi) = 
5 12't=b{i-i)+i '^i'^i^ ^t) ^'^d so one can further show that 



E 



L i=l 



n—1 ib 



L{^*)<WJ^)Y. E E||VL(w,)-V£(w„.,)f +2^ 



(4) 



i=l t= 



In [10], each of ||VL(wi) — V^(wi,Z()l| is bounded by o-q so setting rj. the mini-batch bound provided 
there is obtained. In our analysis we further use the self-bounding property to (4) and get that 



E 



^E^(-^) 



i(w'^)<5^EEW-^ 



2?7(n-l) 



rearranging and setting 77 appropriately gives the final bound. 



□ 



Proof sketch for Theorem 2. The proof of the accelerated method starts in a similar way as in [5]. For 
the 7j'a and ^j's mentioned in the theorem, following similar lines of analysis as in [5] we get the preliminary 
bound 



E[L(w^^^)]-L(w*)< 



27 



(n-1) 



P-i-i 



7(n - l)P+i 



In [5] the step size 7^ = j{i + l)/2 and pi = {i + l)/2 which effectively amounts to p = 1 and further similar 

to the stochastic gradient descent analysis. Furthermore, each E ||VL(w™^) — V£i(wf"^)||^ is assumed to 

be bounded by some constant, and thus leads to the final bound provided in [5] by setting 7 appropriately. 
On the other hand, we first notice that due to the mini-batch setting, just like in the proof of stochastic 
gradient descent. 



E [L(w-g)]-L(w-) 



2p 



ib 

E 

i>(i-i)+i 



E 



VL(wf d) - V£(w'"^ zt) 



md 



7(n-l)i'+i 
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Using smoothness, the self bounding property some manipulations, we can further get the bound 

n-l 

E [L(w^s)] - L(w*) < ^^^^ Yl [^K')] - + "^''^'"^'t^""''^" 

i=l 



7(n-l)P+i -r (,(n-l) 

Notice that the above recursively bounds E [L(w^s)] _ L(w*) in terms of ^2^=1 i^i^t^)] ~ ^i'^*))- While 
unrolling the recursion all the way down to 2 does not help, we notice that for any w G W, L(w) — L{w*) < 
12HD^ + 3L(w*). Hence we unroll the recursion to M steps and use this inequality for the remaining sum. 
Optimizing over number of steps up to which we unroll and also optimizing over the choice of 7, we get the 
bound, 

E [L(w-)] - L(w^) < ^[^^^^Bf^ + '-^^m^W^iKn - 1))^ + fPg 



I AHD^ I 36HD^ log(") 
+ (n-l)P+i + b{n-l) (6(„_i))^ 



Using the p as given in the theorem statement, and few simple manipulations, gives the final bound. □ 



4 Optimizing with Mini- Batches 

To compare our two theorems and understand their implications, it will be convenient to treat H and D 
as constants, and focus on the more interesting parameters of sample size m, minibatch size b, and optimal 
expected loss L{w*). Also, we will ignore the logarithmic factor in Theorem 2, since we will mostly be 
interested in significant (i.e. polynomial) differences between the two algorithms, and it is quite possible 
that this logarithmic factor is merely an artifact of our analysis. Using m = nb, we get that the bound for 
the SGD algorithm is 




E[L(w)]-L(w'^) < 01 + = J^^ + ^], (5) 



m m 



and the bound for the accelerated gradient method we propose is 




E[L(w-)]-L(w*)<a(^^^^ + -i- + i,j =0[,l^ + i^ + ^,]. (6) 

To understand the implication these bounds, we follow the approach described in [3, 12] to analyze large-scale 

learning algorithms. First, we fix a desired suboptimality parameter e, which measures how close to L(w*) 
we want to get. Then, we assume that both algorithms are ran till the suboptimality of their outputs is at 
most e. Our goal would be to understand the runtime each algorithm needs, till attaining suboptimality e, 
as a function of L(w*), e, b. 

To measure this rimtimc, we need to discern two settings here: a parallel setting, where we assume that 
the mini-batch gradient computations are performed in parallel, and a serial setting, where the gradient 
computations are performed one after the other. In a parallel setting, we can take the number of iterations 
n as a rough measure of the runtime (note that in both algorithms, the runtime of a single iteration is 
comparable). In a serial setting, the relevant parameter is m, the number of data accesses. 

To analyze the dependence on m and n, we upper bound (5) and (6) by e, and invert them to get the bounds 
on m and n. Ignoring logarithmic factors, for the SGD algorithm we get 

1 /L(w*) 1 \ 1 fUw*) \ 

n<^M^.^+l m<-i^+b), (7) 
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and for the AG algorithm we get 



1 

n < - 
e 



L{w*) 1 
e ' b 



Vb 



(8) 



First, let us compare the performance of these two algorithms in the parallel setting, where the relevant 
parameter to measure runtime is n. Analyzing which of the terms in each bound dominates, we get that for 
the SGD algorithm, there are 2 regimes, while for the AG algorithm, there are 2-3 regimes depending on 
the relationship between L{w*) and e. The following two tables summarize the situation (again, ignoring 
constants): 



AG Algorithm 



SGD Algorithm 



Regime 


n 


b < yjL{^--)'m 


L(w*) 
1 

€ 


b > v^L(w*)to 





Regime 


n 


e < L(w*)2 


b < L(w*)i/4m3/4 

b > L(w*)V4r„3/4 


L(w*) 

1 

Ve 




b < L{w*)m 


L(w*) 
e^b 


e > L(w*)2 


L{w*)m <b< -rr?!^ 


1 

eVb 




b > m2/3 


1 



From the tables, we see that for both methods, there is an initial linear speedup as a function of the minibatch 
size b. However, in the AG algorithm, this linear speedup regime holds for much larger minibatch sizes^ . Even 
beyond the linear speedup regime, the AG algorithm still maintains a Vb speedup, for the reasonable case 
where e > L{w*)^. Finally, in all regimes, the runtime bound of the AG algorithm is equal or significantly 
smaller than that of the SGD algorithm. 

We now turn to discuss the serial setting, where the runtime is measured in terms of ni. Inspecting (7) and 
(8), we see that a larger size of b actually requires m to increase for both algorithms. This is to be expected, 
since mini-batching does not lead to large gains in a serial setting. However, using mini-batching in a serial 
setting might still be beneficial for implementation reasons, resulting in constant-factor improvements in 
runtime (e.g. saving overhead and loop control, and via pipelining, concurrent memory accesses etc.). In 
that case, we can at least ask what is the largest mini-batch size that won't degrade the runtime guarantee 
by more than a constant. Using our bounds, the mini-batch size b for the SGD algorithm can scale as much 
as L/e, vs. a larger value of Lje'l'^ for the AG algorithm. 

Finally, an interesting point is that the AG algorithm is sometimes actually necessary to obtain significant 
speed-ups via a mini-batch framework (according to our bounds). Based on the table above, this happens 
when the desired suboptimality e is not much bigger then L(w*), i.e. e = f2(i(w*)). This includes the 
"separable" case, L(w*) = 0, and in general a regime where the "estimation error" e and "approximation 
error" L(w*) are roughly the same — an arguably very relevant one in machine learning. For the SGD 
algorithm, the critical mini-batch value \J iy(w*)m can be shown to equal L{vf*)/e, which is 0(1) in our 
case. So with SGD we get no non-constant parallel speedup. However, with AG, we still enjoy a speedup of 
at least 9(-\/&), all the way up to mini-batch size b = rr?l^. 



5 Experiments 

We implemented both the SGD algorithm (Algorithm 1) and the AG algorithm (Algorithm 2, using step-sizes 
of the form 7j = ^i^ as suggested by Theorem 2) on two publicly-available binary classification problems, 

^Since it is easily verified that ^L(w*)m is generally smaller than both L(w*)^/^m^/* and L(w*)m 
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Figure 1: Left: Test smoothed hinge loss, as a function of p, after training using the AG algorithm on 6361 examples 
from astro-physics, for various batch sizes. Right: the same, for 18578 examples from CCAT. In both datasets, 
margin violations were removed before training so that I/(w*) = 0. The circled points are the theoretically-derived 
values p = ln6/(21n(n — 1)) (see Theorem 2). 



astro-physics and CCAT. We used the smoothed hinge loss £(w;x, y), defined as 0.5 — yw^x if yw^x < 0; 
if yw^x > 1, and 0.5(1 — yw^x)^ in between. 

While both datasets are relatively easy to classify, we also wished to understand the algorithms' performance 
in the "separable" case L{w*) = 0, to see if the theory in Section 4 holds in practice. To this end, we created 
an additional version of each dataset, where i(w*) = 0, by training a classifier on the entire dataset and 
removing margin violations. 

In all of our experiments, we used up to half of the data for training, and one-quarter each for validation 
and testing. The validation set was used to determine the step sizes r] and 7^. We justify this by noting that 
our goal is to compare the performance of the SGD and AG algorithms, independently of the difficulties in 
choosing their stepsizes. In the implementation, we neglected the projection step, as we found it does not 
significantly affect performance when the stepsizes are properly selected. 

In our first set of experiments, we attempted to determine the relationship between the performance of 
the AG algorithm and the p parameter, which determines the rate of increase of the step sizes 7^. Our 
experiments are summarized in Figure 5. Perhaps the most important conclusion to draw from these plots is 
that neither the "traditional" choice p = 1, nor the constant-step-size choice p = 0, give the best performance 
in all circumstances. Instead, there is a complicated data-dependent relationship between p, and the final 
classifier's performance. Furthermore, there appears to be a weak trend towards higher p performing better 
for larger minibatch sizes b, which corresponds neatly with our theoretical predictions. 

In our next experiment, we directly compared the performance of the SGD and AG methods. To do so, we 
varied the minibatch size b while holding the total amount of data used for training, m = nb, fixed. When 
L(w*) > (top row of Figure 5), the total sample size m is high and the suboptimality e is low (red and black 
plots), we see that for small minibatch size, both methods do not degrade as we increase 6, corresponding to 
a linear parallel speedup. In fact, SGD is actually overall better, but as b increases, its performance degrades 
more quickly, eventually performing worse than AG. That is, even in the least favorable scenario for AG 
(high L(w*) and small e, see the tables in Section 4), it does give benefits with large enough minibatch 
sizes. Also, we see that even here, once the suboptimality e is roughly equal to L(w*), AG significantly 
outperforms SGD, even with small minibatches, agreeing with our the theory. 
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Turning to the case ^(w'^) = (bottom two rows of Figure 5), which is theoretically more favorable to AG, 
we see it is indeed mostly better, in terms of retaining linear parallel speedups for larger minibatch sizes, 
even for large data set sizes corresponding to small suboptimality values, and might even be advantageous 
with small minibatch sizes. 
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Figure 2: Test loss on astro-physics and CCAT as a function of mini-batch size b (in log-scale), where the total 
amount of training data m = nfe is held fixed. Solid lines and dashed lines are for SGD and AG respectively (for AG, 
we used p = In 6/(2 ln(n — 1)) as in Theorem 2). The upper row shows the smoothed hinge loss on the test set, using 
the original (uncensored) data. The bottom rows show the smoothed hinge loss and misclassification rate on the test 
set, using the modified data where L(w*) — 0. All curves are averaged over three runs. 
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6 Summary 



In this paper, we presented novel contributions to the theory of first order stochastic convex optimization 
(Theorems 1 and 2, generahzing results of [4] and [5] to be sensitive to L (w*)), developed a novel step size 
strategy for the accelerated method that wc used in order to obtain our results and we saw works well in 
practice, and provided a more refined analysis of the effects of minibatching which paints a different picture 
then previous analyses [4, 1] and highlights the benefit of accelerated methods. 

A remaining open practical and theoretical question is whether the bound of Theorem 2 is tight. Following 
[5], the bound is tight for 6=1 and h — > oo, i.e. the first and third terms are tight, but it is not clear 
whether the l/{\/hn) dependence is indeed necessary. It would be interesting to understand whether with a 
more refined analysis, or perhaps different step-sizes, we can avoid this term, whether an altogether different 
algorithm is needed, or whether this term does represent the optimal behavior for any method based on 
6-aggregated stochastic gradient estimates. 

References 

[1] A. Agarwal and J. Duchi. Distributed delayed stochastic optimization. Technical report, arXiv, 2011. 

[2] A. Beck and M. TebouUc. Mirror descent and nonlinear projected subgradient methods for convex 
optimization. Operations Research Letters, 31(3):167 - 175, 2003. 

[3] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS, 2007. 

[4] O. Dekel, R. Gilad Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using 

mini-batches. Technical report, arXiv, 2010. 

[5] G. Lan. An optimal method for stochastic convex optimization. Technical report, Georgia Institute of 
Technology, 2009. 

[6] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to 
stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009. 

[7] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence 
o(l/A;2). Doklady AN SSSR, 269:543-547, 1983. 

[8] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127-152, 2005. 

[9] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 
120(1) :221-259, August 2009. 

[10] O. Shamir O. Dekel, R. Gilad-Bachrach and L. Xiao. Optimal distributed online prediction. In ICML, 
2011. 

[11] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: primal estimated sub-gradient solver 
for SVM. Math. Program., 127(l):3-30, 2011. 

[12] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In ICML, 
2008. 

[13] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In NIPS, 2010. 

[14] S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, Hebrew Uni- 
versity of Jerusalem, 2007. 



10 



[15] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Submitted to 
SIAM Journal on Optimization, 2008. 

[16] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal 
of Machine Learning Research, 11:2543-2596, 2010. 



11 



A Generalizing to Different Norms 



We now turn to general norms and discuss the generic Mirror Descent and Accelerated Mirror Descent 
algorithms. In this more general case we let domain W be some closed convex set of a Banach space 
equipped with norm ||-||. We will use ||-||^ to represent the dual norm of ||-||. Further the iJ-smoothness of 
the loss function in this general case is takes the form that for any z G Z and any w, w' € W, 

\\m{w,z)-\7iiw',z)\i<H\\w-w'\\ 

The key to generalizing the algorithms and result is to find a non-negative function i? : W i-^ K that is 
strongly convex on the domain W w.r.t. to the norm ||-||, that is; 

Definition 1. A function i? : W f-> M is said to be 1-strongly convex w.r.t. norm \\-\\ if for any w,w' e W 

and any a G [0, 1], 



i?(aw + (1 - q)w') < aR{w) + (1 - a)R{w') - 



— II /i|2 



We also denote more generally 



D := J2 sup i?(w) 



The generalizations of the SGD and AG methods are summarized in Algorithms 3 and 4 respectively. The 
key difference between these and the Euclidean case is that the gradient descent step is replaced by a descent 
step involving gradient mappings of R and its conjugate R* and the projection step is replaced by Bregman 
projection (projection to set minimizing the Bregman divergence to the point). 

Algorithm 3 Stochastic Mirror Descent with Mini-Batching (SMD) 
Parameters: Step size r], mini-batch size b. 
Input: Sample zi,. . . ,Zm 
Wi = argmin i?(w) 

w 

for j = 1 to n = m/b do 

Let 4(w,) = iEtlh(^-l)+l^(w^,^t) 
w^+i := \/R* (Vi? (wi) - 7iV4(wi)) ] 

Wi+i := argmin Ar (wlw^i) ( ^i+i = argmin {r?(V^i(wj), w - w^) + (w|wi)} 

end for 

Return w = ^ Yl^^i Wi 

Theorem 3. Let i? : W i— > M 6e a non-negative strongly convex function on W w.r.t. norm ||-||. Let 
K = ^2 sup^.j|^j|<i i?(w). For any w* s W, using Stochastic Mirror Descent with a step size of 



r] = mm < 



U + Y L(w*)f,n 



we have that, 



128HK^R(w*) L(w*) 4L(w*) + 8HR(w*) 16HK'^R(w*) 

E L(w) - L{w*) < \ ^ — ^ + — ^ ^ + — '- 

*' bn n bn 
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Algorithm 4 Accelerated Mirror Descent Method (AMD) 



Parameters: Step sizes (jiyPi), mini-batch size b 
Input: Sample zi, . . . ,Zm 
Wi = argmin i?(w) 

w 

for i = 1 to n = m/b do 

Let €i(w,) := i EtLb(i-i)+i 
w-d := pr^w, + (1 - /3-i)w,^s 

w,+i := argmin (w| w^+i) \ ^i+i = argmin {7i(V^,(w™'i), w _ ^md^ + (^|^md^ } 



wf^,^/3riw,+i + (l-/3ri)w^ 
end for 

Return wf;s 



Theorem 4. Lei i? : W i— > M &e a non-negative strongly convex function on W w.r.t. norm \\-\\. Also let 
2 sup^.|l^ll<;]^ i?(w) . For any w* G W, using Accelerated Mirror Descent with step size parameters 
fii = ^ , 7i = 7iP where 

_ . f 1 / &i?(w*) / b 6R{w*) \3?tt| 

'^~™'''|4tf' Y 174ifif2L(w*)(n-l)2p+i' 1044friv:2(n - l)2p j yi?D2+L(w*)y J ""^ 

. ( ( log(fe) loglog(r7,) 1 

p = mm < max < , — — rr > , 1 

\ \ 2 log(n - 1) ' 2 (log(6(n - 1)) - log log(n)) / 



as long as n> max{783-K'2, ^"^^ we have that : 



E [L(wj^8)] - L(w'^) < 164 HK^R{^*)L{v^*) ^ hSQHK^{R{w^)f/^D-^ ^ 5A5H ./]^) ^ SHR{w* 



b{n-l) Vb{n-1) b{n-l) (n - 1)2 

B Complete Proofs 

We provide complete proofs of Theorems 3 and 4, noting how Theorems 1 and 2 are specializations to the 
Euclidean case. 

B.l Stochastic Mirror Descent 

Proof of Theorem 3. Due to il-smoothness of convex function L we have that, 

-^(Wj+i) < i(Wi) + (VZ/(Wi),Wi+i - Wi) + y||Wj+i - Wi||2 

= L{^i) + (VL^Wi) - V£i(Wi),Wi+i - Wj) + y ||Wi+i - Wj||^ + (V^i(Wi), Wi+1 - Wi) 

by Holder's inequality we get, 

< L(w,) + i|VL(w,) - V€i(w,)|U||w,+i - w,|| + y ||wi+i - Wif + (V£i(wi), Wi+i - w^) 
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since for any a > 0, ab < ^ + 

S L(w.) + "^^'g^;!^^'"- + fl^l|w.« - w.f + f ||w„, - „.f + (V4(w.), w.« - 

We now note that the update step can be written equivalently as 

Wj+i = argmin {77(Vfi(wi), w - w,) + Ai{(w, Wj)} . 

It can be shown that (see for instance Lemma 1 of [5]) 

r/(V^i(wi), Wj+i - Wi) < ?7(V^i(wi), w* - w,) + Afl(w*, w^) - Ajj(w*, Wj+i) - Ajj(wi, Wj+i) 
Plugging this we get that, 

+ ^ (Afl(w*,Wi) - Afl(w*,Wi+i) - AH(wi,Wi+i)) 

= i^(w.) + l|V^KO -V£.(w.)||^ ^ K^w^ ^ _ v^^^), _ + (VL(wO, 

l[l/rj-H) 2r] 

+ ^ (Afl(w*,Wi) - AK(w*,Wi+i) - AH(wi,Wi+i)) 

> i^(w.) + 'l^^^'f-^!;^'^"* + ^^^i^^ + (V^.(w,) - VL(w,), - w.) - (VL(w,), w, 
2[l/rj-H) 2r] 

+ - (Aii(w*,w,) - Afl(w*,Wi+i) - Afi(wi,Wj+i)) 
by strong convexity, AH(wi,Wj+i) > and so, 

^ i(w,) + "^^^^i/'Tg)"^'^"' + W^i) - VL(w,), w* - w,) - (VL(w,), w, - w*) 

+ ^ {AR{w*,Wi) - Ak(w*,w,+i)) 

since r? < ) 

< L{wi) + r?||VL(w,) - V4(w,)||^ + (V^,(w,) - VL(wi), w* - w,) - (VL(wi), - w'^) 

+ - (Afl(w*,Wi) - Afl(w*,Wi+i)) 

by convexity, L{wi) — (Vi(wi), w,; — w*) < L{w*) and so 

< i(w*) + r?||Vi(wi) - V£i{wi)\\l + (V4(wi) - VL(wi), - w^) 

+ ^ (Afl(w*,Wi) - Afl(w*,Wi+i)) 
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Hence we conclude that : 

n-l 



n 



J It — J. ft — J. ^ ft — J. 

i=l ^ ^ i=l i=l 



_^ 1 ^ Afl(w*,W j) - Ar{w*, Wj+i) 



n • 

n-l 



?7 



n-l 



V ||VL(w,) - V£,(w,)||^ + V (V£,(w,) - VL(wO,w* - w^) 

— 1 ) — ' n — 1 ^ — ' 

i—l 1=1 

Ar (w*|wi) - Afl, (w*|w„_i) 



< 



< 



(n 


-1) 




+ 






(n 


-1) 




+ 






(n 


-1) 




+ 



r]{n — 1) 



n—l ^ n—1 

V ||VL(wi) - V£i{wi)\\l + V (Viiiwi) - VL(wi),w* - w^) 

' n—l ^ — ^ 

i—l i—l 



R{w*) 
vin-1) 

n-l 



i=\ i=l 



V{n - 1) 

Taking expectation with respect to sample on both sides and noticing that E [(V£i(wi) — VL(wi), w* — Wi)] 
0, we get that, 



E 



n 



^ 71 — 1 



< 



^|:E[||V£(w.,-V«w.)ia + -^ 



Now note that 

VL(w,) - V^,(w,) = - E (VL(wi) - V£(w„ zO) 

t=(i-l)6+l 

and that (WL(wi) — £(wi, zj)) is a mean zero vector drawn i.i.d. Also note that only depends on the first 
{i — 1)6 examples and so when we consider expectation w.r.t. . . . ,zn, alone, Wj is fixed. Hence by 

Corollary B.2 we have that. 



E 



\VL{wi)-Wei{wi)\\: 



Y iS/L{wi)-We{wi,Zt)) 

t={i-l)b+l 
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Y E[|l(VL(w,)-V£(wi,zt))||f 

t=(i-l)6+l 



Plugging this back we get that 

n-l 



E 



1 



n ■ 



< 



n — l hi 

— ^E E E[ll(Vi(w,)-Vf(w„.,))||: 

^ ' 1=1 t=(i--l)f,+l 



+ 



^ i = l t={i-l)h+l 



i — l ib 



+ 



Vin - 1) 



V{n- 1) 
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for any non-negative iJ-smooth convex function /, we have the self-bounding property that ||V/(w)||^ < 
V'4ff/(w). Using this, 



^ j=l t=(i-l)6+l 

n-1 



E 



r?(n - 1) 



n 



1 " " 



i?(w*) 
r?(n - 1) 



Adding ^^L(wi) on both sides and removing Liv/n ) on the left we conclude that 



E 



^ n— 1 



L{w*) < — '- E 



^ n—1 



i?(w*) L(wi) 



?7(n — 1) n—1 



Hence we conclude that 



E 



n 



i=l 



i(w*) < 



7 -t^lW ) H 



, 6 



1 L(w*) 



1 /^i.(wi) _^ E(w* 



6 



?7n 



L(wi) 
n 



+ 



6 16HK^R{w*) 



bn 



Writing a = ^ ^^U^^ - 1, so that r] = (l - ^) we get, 

■ n 



E 



n 



L{w ) < aL{w ) + — — - + - 



< aL{w*) + ^ ^ ' + I a + 



n 



a bn 
1\ 32HR{w*) 



a J bn 



Now we shall always pick r/ < ^2hk^ ^° a < 1 and so 



E 



1 " 



r/ *x ^ r/ *x S2HK^R{w*) 2L(wi) 16HK'^R{w*) 
L{w*) < aL{-w*) H — ^ H ^-^^ + ^ ^ 



a 6n 



n 



6n 



Picking 



rj = mm 



1 



2H' 32HK^' 



L(w*)HK^n 



16(1 + ^^1^) j ' 



or equivalently a = min |l, ^ ^^"I^^I^i^ ^^ we get, 



E 



- i(w*) < 



128HK'^R{w*) L{w*) _^ 2L(wi) _^ 16HK^R{w*) 



bn 



bn 
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Finally note that by smoothness, 

L(wi) < L(w*) + (VZ/(wi) - VZ/(w*), wi - w*) + (VZ/(w*), wi - w*) 

< L(w*) + ||VL(wi) - VL(w*)||, ||wi - w*|| + ||VL(w*)||, ||wi - w* 

< L(w*) + H ||wi - w*||^ + y/4:HL{w*) ||wi - w*|| 

Since R is 1-strongly convex and Wi = argmin i?(w), 

w 

< L(w*) + 2HR{yv*) + ^/8HL{w*)R{w*) 

< 2L{w*) + 4:HR{w*) 



Hence we conclude that 



E 



n 

i=l 



^^^*-) < 128HK^R{w*) L(w*) ^ 4L(w*) + 8i7i?(w*) ^ imK^R{vf*) 



bn n bn 

Using Jensen's inequality concludes the proof. □ 

Proof of Theorem 1. For Euchdean case ii(w) = \ ||w||2 and K = sup^.M^II <j ||w|| — 1. Plugging these 
in the previous theorem concludes the proof. □ 

B.2 Accelerated Mirror Descent 

Lemma B.l. For the accelerated update rule, if the step sizes /3, e [l,oo) and 'ft G (0, oo) are chosen such 
that (3i = 1 and for all i e [n] 

0<7i+i(/3i+i-l)</3i7i and 2i?7i < ft 

then we have that 

Proof. First note that for any i, 

- y^T" = /3r'wi+i + (1 - /3-^)wf - 

= /3riw,+i + (1 - /3ri)wr - V, - (1 - /3r^)wt^ 

= /3ri(wi+i-w,) (9) 

Now by smoothness we have that 
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since wtl, = Ariw,+i + (1 - ^^l)w,^^ 

= L(w-) + (1 - /3r^)(vL(w-), w- - W-) + (vi-(wr-^),w.^i-wr'^) ^ iiw. - w.+,ir 



2 



-||Wi+i -Wi 



Pi ^Pili 



2 ||Wj+i-Wj| 



- (1 - /3, )i(w, ) + + -^^^ 2;s^ii^^+^ - ^^11 



^ ft 

^ ft 



by Holder's inequality, 



^Pi '^Pili Pi 



ft 



since for any a, h and a > 0, a6 < + ^ 



+ 



2(A/7i - H) 20a^ 



ft 

We now note that the update step 2 of accelerated gradient can be written equivalently as 
Wi+i = argmin {^.(V^iCw-^), w - w^^) + {w\wT^)} . 

wGW 

It can be shown that (sec for instance Lemma 1 of [5]) 

7i<V^i(wO, Wi+i - wfd) < 7i(V^i(w-d), w* - w™d) + (w*|w,) - (w*|wi+i) - (wi|wi+i) 
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Plugging this we get that, 



i(w,^i) < (1 - /3, )i(w,s) + or«./.,._ m + ^Tr:;^ + 



2(A/7i - ^) 2A7i 
, ^(wr^) + (V^(wr'^) - V^»(wr^), w, - wf'd) ^ Afl (w*|w,) - Afi (w-|w,+i) - (w,|w,+i) 
A Tift 

by strong-convexity of -R, Ajj (wi|wj+i) > | ||wj — Wj+i||^ and so, 

- n /^-lurw-g^ ^ i|vL(wr'^)-v^.(wrd)ii; (v^.(wr'),w--wr'^) 

-(1-^, )i:(w, )+ 2(ft/7i - if) + ft 

, L(wrd) + (VL(wfd)-V£,(w^md)^^._^md^ ^ Afl (w* K) - Afl (w* | W,+l ) 

ft Tift 

)i(w, )+ 2(/3./7i - if) + ft 

^ (VL(wr^)-Vf,(wr'^),w,-w,'"^) ^ Aj,(w*|w,)-A^(w*|w,+i) 

ft Tift 
L(wf'*) + (VL(wf"i),w* - wf"i) 

^ ft 
by convexity, i(w*) > L(w^"'^) + (VL(wf"i),w* - wj"'*), hence 

R-^^T( ag, , ||Vi^(wr'^)-V^i(w-°'i)||^^ ^ (V^.(w^'d)_v^(^md)^^._^md^ 

<(l-ft )i(w, )+ 2(ft/7i - H) + A 

^ (VL(wfd)-V^,(w,"-d-)^^._^md^ ^ Afl(w*|w,)-AH(w*|w,+i) ^ L(w-) 

ft Tift ft 

-n «-i^rr ag, , ||Vi^(wr'^)-V^.(wr'^)||' , (V£(wr-^)-V£,(wf'd),w,-w-) 
)i(w, )+ 2(A/T. - ff) + ft 

, Afl (w*|wj)- Afl (w*|w,+i) 1 
+ ^^ft "^^"^ 



+ 



2(A/Ti -H) ' A 

Ah (w*|wi) - Ai{ (w'^lwj+i) 



Tift 

Thus we conclude that 

_l_ Afl (w*|wj) - Afl (w*|wj+i) 

ftTi 

Multiplying throughout by ftTi we get 

7.ft (L(w-,) - L(w^)) < 7.(ft - 1) (L(w-) - L(w*)) + "^"^^^1^^'^^:^^^ 

+ Ah (W^Iw,) - Ah (w*|w,+i) + 7,(VL(wr<i) - V4(wr'i), - w*) 
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Owing to the condition that 7^+1 (^j+i — 1) < 7j/3i we have that 



7^+1 (A+i - 1) (L(w^^i) - L(w*)) < 7i(A - 1) (i^(wf ) - L(w*)) + 



2(ft/7i - if) 

+ Ar (w*|wi) - Afl (w*|w,+i) + ^i{VL{wT'') - ViiiwT''), Wi - w*) 
Using the above inequality repeatedly we conclude that 

7„(/3„ - 1) (L(w^^) - i(w*)) < 71 (A - 1) (i(wt^) - i(w*)) + 5^ " 2(A/7,-g) ' 

+ Ar (w*|wi) - Ah (w*|w„) + ^ 7i(Vi(w™'^) - V^iCw-'^), - w*) 

i=i 

< ..(ft - 1) (L(wf ) - L(w')) + g 

+ i?(w*) + ^7i(VL(wr^) - V£i(w-<i),Wi - w*) 

i=l 

= 71 - 1) (L(wi*^) - L(w )) + 2^ 



2(ft/7i - if) 



i=l 



since 2H'-fi < /3j, 



R{w*) 



< 71 (A - 1) (i(wf ) - i^(w*)) + E II Vi(wr'^) - V£,(w: 

ra-1 

+ E 7^(Vi(wr^) - V£,(wf<i), w, - w*) 

i=l 

n-1 

< 71 (A - l)i(wf ) + E 27f l|VL(wf ) - Vf,(wf )||f + i?(w*) 

i=l 

n-1 

+ E 7i<Vi(wr'i) - V£i(wfd), - w*) 

i=l 
n-1 

+ E 27? ||Vi(wrd) - V^i(wr<^) - VL(w,^^) + 

Taking expectation we get that 

n-1 

7„(/3„ - 1) (E [Liw^J^}] - L(w*)) < 71 (/3i - l)L(wf ) + ^ 27.'E [l|VL(wf ) - V^,(wf )||f] + i?(w*) 

i=l 

n-1 

+ E 27.'E IWVLiwT") - V£i(w-'^) - VL(wf ) + V^.(w;^^ 



(10) 
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Now note that 



Vi(wf)-V^,(wr) = J (VL(wf)-^(wf,^,)) and 



t=(z-l)6+l 



vL(wf )-v^i(w^s)-vi(wr'i)+v^,(wr<i) = - (vi(wf ) - ^(w^^ - vL(wr<^) + ^(w^^ ^t)) 

t=(i-i)&+i 

Further (VL(wi) - £(wi, z*)) and (VL(wf'^) - ^(w^''^, zj) - VL(w™^) + ^(w™^, z*)) are mean zero vectors 
drawn i.i.d. Also note that w^^ only depends on the first (i — 1)6 examples and so when we consider 
expectation w.r.t. . . . , Zib, w, is fixed. Hence by Corollary B.2 we have that, 



E 



||Vi(wr)-V£,(wf)||^ 



hi 



Y (Vi(wf)-V^(wr,^0) 

t=(i-l)6+l 



< 



J2 E[||(Vi(wf)_v£(wf,z,))llf 

t=(i-l)6+l 



and similarly 



E 



I VL(wf ) - V^i(wf ) - VL(w-'i) + V^i(w-<^) 

^ I? E ^ [||VL(wf ) - v£(wf - vLiwT") + m{wr'',zt)\\l 

t=(j + l)b+l 

Plugging these back in Equation 10 we get : 
7„(/3„ - 1) (E [L(w^«)] - L(w*)) < 71 (/3i - l)i(wt«) + ^ 
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E [ii(vL(wf«) - v£(wr^^,))ll'] + Ri^*) 



t={i-l)6 + l 



VL(wf ) - V£(wf , ^0 - VL(wr'') + V£(wr, ^t) 



t=(i+l)6+l 

1-1 . ^^2 2 



<7i(/3i-l)Mwi^) + ^^^ ^ E[||VL(wr)||J + ||V£(wf,zt)||^]+il(w*) 



^E"^ E ^ 



t=(i+l)6+l 



t=(i-l)fe+l 



for any nou-ucgative i?-smooth convex function /, we have the self-bounding property that ||V/(w)|| < Y^4if/(w). 
Using this, 



"-1 -ifyfjiy^i 2 

<7i(/3i-l)Mwt") + ^ ^' 1^ E[L(w^-)+£(wf,20]+^?(w*) 



t=(i-l)6+l 



^E"^ E ^ 



t=(i+l)6+l 
n-1 



VL(wr) - VLCwf'^) + V£(w^^^t) - V£(wr,2i) 



= 71 (/3i - l)L(w?«) + ^ '"^f^ ^^ E [L(w,^«)] + 



n— 1 . T-j'9 9 



t=(i+i)fe+i 



VL(wf ) - VL(wr) + V£(wr,0t) - V£(wf 
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by //■-smoothness of L and i wc have that ||VL(w^'^) - VI/(wr'')||^ < H ||w^* - wf^H. Similarly we also have that 
||V£(wf ,at) - V^(wr^at)||^ <H\w'f Hence, 



< 71 - l)L(wr) + ^ ^^^^ ^' E[L(wf)] + ii(w*) 



^-1 QZJ2 7^2 2 



However, wr'^ 
Hence, 



V, + (1 - /3ri)w-. Hence llw- - ^T^? < t^^^ < n^^-^^f+^jl^^-^t^ < 



Dividing throughout by Jni^n — 1) concludes the proof. 



□ 



Proof of Theorem 4- First note that the for any i, 

2Hji = 2H^iP < ^ < A 

Also note that since p e [0, 1], 

7i+i(ft+i - 1) = 7 - ^ 2 — " "^'^^ 

Thus we have verified that the step sizes satisfy the conditions required by previous lemma. Prom the 
previous lemma we have that 

" ^ ^ - 7n(/3n - 1) ^ ^ 67„(/3„ - 1) -1) &7n(/3„ - 1) ^ ft' 

\mP{n - 1) ^ L I i + ^„p(„ _ 1) + ;,„p(„ _ 1) ^ (i + l)^ 



< 



^^^^!l(!Lzl)!! VFfn ag.. , 2i?(w*) 256g^ j^^Z?^7 1 
6nP(n - 1) ^ ^ ^{n - 6(n - ^ i2(i-p) 

- 6(n - ly-P ^ ^ ^ * 7(n - 6(n - ^ i2(i-p) 



6(n-l)i-P^' L V » /J V ' ^ 7(n-l)P+i 

256i72/s:22^2^ 
&(n- 1) 
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since 7 < l/4iJ, 



b{n - 1) 
Thus we have shown that 



+ 



^ 6(n - 1) 

Now if we use the notation ai=E [L(wf )] - L{w*), A{i) = ^t-iv-v and 



b{i-iy-v 

6AHK'^^L{w*){i-l)P 2i?(w*) GAHK^D^ 
b ^ j{i - ^ 6(« - 1) 



Note that for any i by smoothness, < Lq := ^HD^ + Z/(w*) Also notice that 



i=n-M-l i=n-M-l ^ ' 

Hence as long as 

^ - QAHK^nP ' ^^^^ 

Sr=n-M-i ^(*) — 1- shall ensure that the 7 we choose will satisfy the above condition. Now applying 
lemma B.3 we get that for any M, 



a„<e^(n) ao(n-M)+ ^ B(i)\ + B{n) (12) 

V i=n-M-l ) 

Now notice that 

_ 64gj^^7L(w*) ^ 1 2J?(w*) ^ 1 64HK^D^ ^ 

i=n-M-l 



b ^ (i-l)P 7 ^ b ^ (i-1) 

i=ra-M-l ^ ' i=„-M-l ^ i=n-M-l ^ ^ 

6AHK'^jL{w*){n- M -2)P 2R{w*) GAHK^D'^ 64HK^'yL{w*){n ~ 1)p+'^ 

- b ^ -f{n-M- 2)P+i b{n-M- 2) 6 
2i?(w*) 64iJJs:2£)2iog „ 
7(n - M - 2)P ^ 6 
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Plugging this back in Equation 12 we conclude that 

an<- ^(Lo(n-M) + L^_A )_ ^ V 



-1 ^ 



b{n - M - 2) ^ 6 7(n - M- 2)P 6 



6 7(n - b{n - 1) 

64eHK^-f ( 64HK^D^ 4i?(w*) 64iJii'2£)2 log(n) 

- 6(n - ~ ^ ~ + - M - 2) + - M - 2)P + 6 

256HK^jL{vr*){n-l)P+^\ 6AHK^jL{y/*){n-l)P 4i?(w*) 64ifi^2£,2 
b ) 6 7(n - 6(n - 1) 

biuce 7 ^ eiHK^nP 6(n-M-2) - 7(n-M-2)P ' 

GAeHK^-f / , ,^ , 6i?(w*) 64fl'ii'2£>2 log n 

- b{n - ly-P l"^"^" " + 7(" - - 2)P + ^ 

_^ 256iJi4:27L(w*)(n- 64ifi4:27L(w*)(n- 1)P ^ 4i?(w^) ^ 64HK^D^ 



b J b 7(n- 1) 

We now optimize over the choice of M above by using 

'6i?(w*)\ ^ 



(n - M - 2) 



Ofcourse for the choice of M to be valid we need that n — M — 2 < n which gives our second condition on 7 
which is 

7>^ (13) 

Plugging in this M we get, 

< T7~ iM_„ 2Z/0 H 1 



b{n - ly-P \ ^ \ 7 

64HK'^7L{-w*){n-l)P 2R{vr*) GAHK'^D^ 
^ b 7(n - ^ b{n - 1) 

USeHK'^ji^Lf^ {6R{^v*))^ , 2e{64HK^-ffL{w*){n-l)^P , 2i?(w*) 



+ 



b{n-iy-P 62 7(n- 

2e{64:H K^^D^-f log n 64:HK^7L{w*){n - ly QAHK^D^ 
62(n- ^ b 6(n- 1) 

however by condition in Equation 11, 7 < gjg^^s^, hence 

348i?ii'27?TiL^ (6i?(w*))5TT 2e(64iJif 2)21)2^ log n 

— TT -< \ 1 l~ 



6(n-l)i-P 62(„_i)i- 



348gj<:27L(w*)(n- 1)P 2j;(w*) GAHK^D^ 
^ 6 +7(n-l)P+i + ^ ^ 
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We shall try to now optimize the above bound w.r.t. 7, To this end set 



1 / bR{w*) / b y^+i /6i?(w*)yf+i 



We first need to verify that this choice of 7 satisfies the conditions in Equation 11 and 13. To this end, note 
that as for the condition in Equation 11, 



7< 



lOAAHK'^in-iyp 



/ 6i?(vir*)yp+i 



and hence it can be easily verified that for n > 3, 7 < gjff^s^- On the other hand to verify the condition 
in Equation 13, we need to show that 



7 = mm • 



AH' V 174:HK'^L{Mv*){n - l)2p+i ' V 10UHK^{n - l)2p 



p+i 



6i?(w*) / b y^+i /6i?(w*)yf+i 



6i?(w*) 
- nP+1 (Lo) 

It can be verified that this condition is satisfied as long as, 

n > max < 3, ; , — ; > 

I b b } 

So in effect as long as n > 3 and sample size nb > max{783i4r2, ^"^^f/^^ ^ } the conditions are satisfied. Now 
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plugging in this choice of 7 into the bound in Equation 14, we get 



p+i 

2p+l 



- Y 6(n - 1) + 3 6(n-l) ) + 6(n - 1) 



(n- ''^ ' \bin-l) J \ Lo 



p+i 

2p + l 



^ /2784ffif2^(w*)L(w*) _^ 2 / 6264i?X2^(w*)Lo^+' \ _^ GiHK^D'^ 



6(n- 1) 3 I 1) / 6(n - 1) 

8HR(w*) , , /64i/ii'2\ST^ /6i?(w*)\5i^FT 

+ 7 TwIT + los(") ( ^ M 



p+i 

2p+l 



2784:HK^R{w*)L{w*) 2 / 6264iri^2^(w*)L|+^ \ QAHK^D^ 
-y 6(n- 1) "^3\^ &(n - 1) j 6(n - 1) 

8HR{^*) / (96X2)f+i D^y^' /64i/if2^(w*)\ log(n) 



(n-l)P+i \^ i?(w*) y V J {b{n-l)) 



\ 2p + l 



< 



b{n-l) 1) U264i?i^2^(w*) j ~ 6(n - 1) 



^ 8HR{w*) ^ f64:HK^D^\ log(n) /^384ffi4:2i?(w*) y^+i 



^ b{n-l) J (b{n-l))^ V 



< 



l2784:HK^R{w*)L{w*) Al7mK^R{yv*) f Lq V"""' r?,r ^^\^\^ MHK^D^ 



8HR{w*) /64HK^\ log(n) 
+ (n-l)f+i + I 6(n-l) J (6(n-l))^ 

Picking 

log(fe) loglog(n) 



p = min Imax 



2 log(n - 1) ' 2 (log(6(n - 1)) - log log(n)) 
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we get the bound, 

j2784HK'^R{w*)L{w*) ( 4176H K'^ R{w*) 4:176HK^R{w*)^yiog{n)\ / Lq 
- Y 6K^1) + 1^ y/b{n-l) ^ Hn^) ) [6264HK^R{w*) 

120HK'^D^ 8ffi?(w*) 8HR{yv*) GAH ^\og{n) 
^ 6(n-l) + (n-l)2 



^ 2784HK^R{w*)L{w*) ^ I m6HK^R{w*) _^ A176HK'^R{w*)y/log{n) \ ( Lq 



b{n-l) \ yfe(n-l) j \Q2MHK'^R{w*) 

UOHK^D^ 8HR{w*) 8HR{w*) QAH y^\og{n) 
^ 6(n-l) ^ (n-l)2 &(" - 1) 



2784:HK^R{w*)L{w*) 4:54:{HK^R{w*))^/'^L§ 454{HK^R{w*)f/'-^L^ ^/[og(nj 
- V 6(n - 1) ^ - 1) ^ 



_^ 120i?if2£)2 ^ s,HR{w*) _^ 8HR{w*) _^ 64iIiV'2£)2yiog(n) 



6(n-l) (n-l)2 ^(n-1) b{n-l) 
Recall that io = f-ff-D^ + i(w*). Now note that if L(w*) < HK^D^/2 then Lq < 2HK'^D'^, on the other 
hand if i(w*) > HK^D^/2 then {HK'^R{w*)f/^L^ < ^AHK'^R{w*)L{mv*). Hence we can conclude that, 



2784HK^R{-w*)L{w*) 454HK'^{R{-w*)f^^{2D'^)i 454ifii'2(i?(w*))2/3(2D2)5yiog(n) 

120HK^D'^ 8HR{w*) 8HR{yv*) 64H ^/k^g{nj 908^/HK^R{w*)L{w*) 
^ b{n-l) + (n-l)2 +y6(n-l)^ 6(n - 1) ^ x/6(n - 1) 

908^yHK^R{w*)L{w*) log(n) 
^ 6(n- 1) 

Since n > 783ii'^ and i?(w*) < D'^/2 we can conclude that 



HK^R{w*)L{xv*) 580HK^{R{w*)f/^D^ M5H ^log{n) 8HR{w*) 
- ^^^Y H!^) ^ Vb{n-1) ^ K^^"^!) ^ (n-l)2 

This concludes the proof. □ 

Proof of Theorem 2. For Euclidean case ii'(w) = ^ ||w||2 and K = ^sup^.||^||^<i ||w||^ = 1. Plugging these 
in the previous theorem (along with appropriate step size) we get 



E[i:(w„ )]-L(w)<116y + + ^^^^-^^ +7n^ 

The second inequality is a direct consequence of the fact that ||w*|| < £). □ 



B.3 Some Technical Lemmas 



Lemma B.2. Denote K := ^2 sup^.||^||<i -R(w), then for any xi, . . . , X(, mean zero vectors drawn iid from 
any fixed distribution, 



E 



1 " 



< 



t=l 
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Proof. We start by noting that 



1 ' 



sup 

1 w:|| wll <1 



= inf - sup ( w, - V xt ) 
V " a;w:||w|l<i \ ^t^i I ) 

< I inf J- sup ii(w) + -i?* I " Vxt 
\ " [a w:||w||<i a \b 'fr{ 



« I 2a a \ b ^ 



(16) 



where the step before last was due to Fenchel- Young inequality and R* is simply the convex conjugate of R. 
Now For any i S [b] define Si = R* (^f J2t=i ^t) • We claim that 

E[5,]<E[5i_i] + ^E[||x,||^] 

To see this note that since R is 1-strongly convex w.r.t. ||-||, by duality R* is 1-strongly smooth w.r.t. ||-||^ 
and so for any i G [b], 



Xt , X,; 



t=l 



2&2 



^ II l|2 



taking expectation w.r.t. Xj and noting that E [xj] = by assumption we see that 

Ex. [Si] < + ^Ex. [|lx,|l^^ 
Taking expectation we get as claimed that : 

E[5,]<E[5,_i] + ^E[||x,||^] 
Now using this above recursively (and noting that = ) we conclude that 

ns^]<^,i2^[\\Ml] 

Plugging this back in Equation 16 we get 



t=i 



E 



t=i 



< inf • 



inf • 



2a 262 



t=i 



X,; 



t=l 



^Ee[KIIS 

t=l 



□ 
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Lemma B.3. Consider a sequence of non-negative number ai, . . . , a„ e [0, ao] that satisfy 

n-l 

On < A{n) ^ fli + B{n) 



i=l 



where A is decreasing in n. For such a sequence, for any m £ [n], as long as A{i) < 1/2 for any i > n — m — 1 
a'nd T,7=n-m-i ^(0 < 1 then 

an < eA{n) i ao{n - m) + ^ B{i)\+B{n) 

\ i=n—m—l / 

Proof. We shall unroll this recursion. Note that 

n-l 

0"n. < A{n) ^ Oi + B{n) 

i=l 

= A{n) + ""-1 j + -^(") 

(n-2 n-2 \ 

^ ai + A{n - 1) ^ a, + B(n - 1) + B{n) 
i=l i=l / 

n-2 

= A{n){l + A{n - 1)) ^ a, + B{n) + A{n)B{n - 1) 

i=l 

(n-3 n-3 \ 

^ ai + A{n - 2) ^ tti + S(n - 2) + +B(n) + A(n)B(n - 1) 
i=l i=l ) 

n—3 

= A{n){l + A{n - 1))(1 + A{n - 2)) '^a^+ B{n) + A{n)B{n - 1) + A{n){l + A{n - l))B{n - 2) 

i=l 

Continuing so upto m steps we get 

(m— 1 \ n—ra /m — 1 / i — 1 \ \ 

[](l + A(n-z)) ai + B{n)+A{n) ^ [](l + A(n-j)) B(n - i) (17) 

i=i / j=i \i=i \i=i / / 

We would now like to bound in general the term ni^^^(l + ^('^ ~ *))• To this extant note that, 

m— 1 /m — 1 \ 



(1 + A{n - i)) = exp ^ log(l + A{n - i)) 



i=l \i=l / 

Now assume A{i) < 1/2 for alH > n — m — 1 so that log(l + A{n — i)) < A{n — i). We get 

m— 1 /m— 1 \ 



n(l + ^(n-i))<exp ^ A( 

i=l \ i=l 

Now if X^iLn-TO-1 ^(*) — -'- ttien we can conclude that 

m— 1 

(1 + A(n - i)) < e 
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Plugging this in Equation B.3 we get 

/ n—rn ra—1 



a„ < eA{n) ^ + ^ B(n - + B{n) 

\ i=l i=l / 

/n—m n \ 

= eA{n) ^ ai + ^ B{i) + B{n) 



Now if for each i < n, Ui < uq then we see that 

an < eA{n) i ao{n - m) + ^ B{i)\+B{n) 

\ i=n—m—l / 

Hence we conclude that as long as J27=n-m-i ^(^) — ^ 

an < eA{n) i ao{n - m) + ^ B{i)\+B{n) 



i=n—m—l / 

□ 
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